The dataset can be obtained from here: https://www.kaggle.com/datasets/joniarroba/noshowappointments
The dataset has 110,527 medical appointments with 14 associated fatures giving information on the medical appointments. The patient's show-up or no-show up to the appointment is the target variable.
# Importing all necessary packages
import numpy as np
import pandas as pd
import plotly
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
#setting a dark grid for the plots
sns.set_style('darkgrid')
Load data.¶# Loading the dataset into a pandas dataframe
appointment_df=pd.read_csv("noshowappointments-kagglev2-may-2016.csv")
appointment_df.head()# Reading the first 5 rows.
| PatientId | AppointmentID | Gender | ScheduledDay | AppointmentDay | Age | Neighbourhood | Scholarship | Hipertension | Diabetes | Alcoholism | Handcap | SMS_received | No-show | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.987250e+13 | 5642903 | F | 2016-04-29T18:38:08Z | 2016-04-29T00:00:00Z | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | No |
| 1 | 5.589978e+14 | 5642503 | M | 2016-04-29T16:08:27Z | 2016-04-29T00:00:00Z | 56 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 2 | 4.262962e+12 | 5642549 | F | 2016-04-29T16:19:04Z | 2016-04-29T00:00:00Z | 62 | MATA DA PRAIA | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 3 | 8.679512e+11 | 5642828 | F | 2016-04-29T17:29:31Z | 2016-04-29T00:00:00Z | 8 | PONTAL DE CAMBURI | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 4 | 8.841186e+12 | 5642494 | F | 2016-04-29T16:07:23Z | 2016-04-29T00:00:00Z | 56 | JARDIM DA PENHA | 0 | 1 | 1 | 0 | 0 | 0 | No |
Basic Checks.¶# Checking the last 5 rows
appointment_df.tail()
| PatientId | AppointmentID | Gender | ScheduledDay | AppointmentDay | Age | Neighbourhood | Scholarship | Hipertension | Diabetes | Alcoholism | Handcap | SMS_received | No-show | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 110522 | 2.572134e+12 | 5651768 | F | 2016-05-03T09:15:35Z | 2016-06-07T00:00:00Z | 56 | MARIA ORTIZ | 0 | 0 | 0 | 0 | 0 | 1 | No |
| 110523 | 3.596266e+12 | 5650093 | F | 2016-05-03T07:27:33Z | 2016-06-07T00:00:00Z | 51 | MARIA ORTIZ | 0 | 0 | 0 | 0 | 0 | 1 | No |
| 110524 | 1.557663e+13 | 5630692 | F | 2016-04-27T16:03:52Z | 2016-06-07T00:00:00Z | 21 | MARIA ORTIZ | 0 | 0 | 0 | 0 | 0 | 1 | No |
| 110525 | 9.213493e+13 | 5630323 | F | 2016-04-27T15:09:23Z | 2016-06-07T00:00:00Z | 38 | MARIA ORTIZ | 0 | 0 | 0 | 0 | 0 | 1 | No |
| 110526 | 3.775115e+14 | 5629448 | F | 2016-04-27T13:30:56Z | 2016-06-07T00:00:00Z | 54 | MARIA ORTIZ | 0 | 0 | 0 | 0 | 0 | 1 | No |
# Sampling randomly 5 rows of the dataset.
appointment_df.sample(5)
| PatientId | AppointmentID | Gender | ScheduledDay | AppointmentDay | Age | Neighbourhood | Scholarship | Hipertension | Diabetes | Alcoholism | Handcap | SMS_received | No-show | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 91607 | 4.254279e+11 | 5739705 | F | 2016-05-25T13:59:19Z | 2016-06-01T00:00:00Z | 49 | ANDORINHAS | 0 | 1 | 1 | 0 | 0 | 1 | No |
| 104385 | 9.742971e+13 | 5767594 | F | 2016-06-02T18:50:11Z | 2016-06-08T00:00:00Z | 60 | ROMÃO | 0 | 1 | 0 | 0 | 0 | 1 | Yes |
| 48902 | 9.264355e+12 | 5568058 | F | 2016-04-11T13:48:35Z | 2016-05-09T00:00:00Z | 36 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 93595 | 8.244175e+11 | 5777407 | F | 2016-06-06T13:44:22Z | 2016-06-06T00:00:00Z | 25 | SANTO ANDRÉ | 1 | 0 | 0 | 0 | 0 | 0 | No |
| 5455 | 8.663215e+13 | 5638115 | F | 2016-04-29T07:43:42Z | 2016-05-03T00:00:00Z | 53 | TABUAZEIRO | 1 | 1 | 0 | 0 | 0 | 1 | Yes |
# Cheking the number of columns and rows in the dataset.
print('We have {} rows and {} columns in the dataframe.'.format(appointment_df.shape[0],appointment_df.shape[1]))
We have 110527 rows and 14 columns in the dataframe.
# General information about the dataset
appointment_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 110527 entries, 0 to 110526 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PatientId 110527 non-null float64 1 AppointmentID 110527 non-null int64 2 Gender 110527 non-null object 3 ScheduledDay 110527 non-null object 4 AppointmentDay 110527 non-null object 5 Age 110527 non-null int64 6 Neighbourhood 110527 non-null object 7 Scholarship 110527 non-null int64 8 Hipertension 110527 non-null int64 9 Diabetes 110527 non-null int64 10 Alcoholism 110527 non-null int64 11 Handcap 110527 non-null int64 12 SMS_received 110527 non-null int64 13 No-show 110527 non-null object dtypes: float64(1), int64(8), object(5) memory usage: 11.8+ MB
# Statistical summary of the numerical columns
appointment_df.describe()
| PatientId | AppointmentID | Age | Scholarship | Hipertension | Diabetes | Alcoholism | Handcap | SMS_received | |
|---|---|---|---|---|---|---|---|---|---|
| count | 1.105270e+05 | 1.105270e+05 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 |
| mean | 1.474963e+14 | 5.675305e+06 | 37.088874 | 0.098266 | 0.197246 | 0.071865 | 0.030400 | 0.022248 | 0.321026 |
| std | 2.560949e+14 | 7.129575e+04 | 23.110205 | 0.297675 | 0.397921 | 0.258265 | 0.171686 | 0.161543 | 0.466873 |
| min | 3.921784e+04 | 5.030230e+06 | -1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 4.172614e+12 | 5.640286e+06 | 18.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 3.173184e+13 | 5.680573e+06 | 37.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 9.439172e+13 | 5.725524e+06 | 55.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| max | 9.999816e+14 | 5.790484e+06 | 115.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 1.000000 |
# Description of the categorical columns
appointment_df.describe(include='O')
| Gender | ScheduledDay | AppointmentDay | Neighbourhood | No-show | |
|---|---|---|---|---|---|
| count | 110527 | 110527 | 110527 | 110527 | 110527 |
| unique | 2 | 103549 | 27 | 81 | 2 |
| top | F | 2016-05-06T07:09:54Z | 2016-06-06T00:00:00Z | JARDIM CAMBURI | No |
| freq | 71840 | 24 | 4692 | 7717 | 88208 |
# Checking missing values
appointment_df.isnull().sum()
PatientId 0 AppointmentID 0 Gender 0 ScheduledDay 0 AppointmentDay 0 Age 0 Neighbourhood 0 Scholarship 0 Hipertension 0 Diabetes 0 Alcoholism 0 Handcap 0 SMS_received 0 No-show 0 dtype: int64
# Checking duplicates
appointment_df.duplicated().sum()
0
The dataset given on medical appointments:
In this section we will deal with cleaning the following:
# Creating a copy to work with
app_df=appointment_df.copy()
app_df.head(1)
| PatientId | AppointmentID | Gender | ScheduledDay | AppointmentDay | Age | Neighbourhood | Scholarship | Hipertension | Diabetes | Alcoholism | Handcap | SMS_received | No-show | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.987250e+13 | 5642903 | F | 2016-04-29T18:38:08Z | 2016-04-29T00:00:00Z | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | No |
# Dropping the unnecesary columns
dropcols=['PatientId','AppointmentID']
app_df.drop(columns=dropcols,axis=1, inplace=True)
app_df.head(1)
| Gender | ScheduledDay | AppointmentDay | Age | Neighbourhood | Scholarship | Hipertension | Diabetes | Alcoholism | Handcap | SMS_received | No-show | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | F | 2016-04-29T18:38:08Z | 2016-04-29T00:00:00Z | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | No |
# Renaming the hipertension column
app_df['Hipertension']=app_df.rename(columns={'Hipertension':'Hypertension'},inplace=True)
app_df.head(1)
| Gender | ScheduledDay | AppointmentDay | Age | Neighbourhood | Scholarship | Hypertension | Diabetes | Alcoholism | Handcap | SMS_received | No-show | Hipertension | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | F | 2016-04-29T18:38:08Z | 2016-04-29T00:00:00Z | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | No | None |
# Parsing the dates in both Scheduled day and Appointment day columns
app_df['ScheduledDay']=pd.to_datetime(app_df.ScheduledDay)
app_df['AppointmentDay']=pd.to_datetime(app_df.ScheduledDay)
app_df.head(1)
| Gender | ScheduledDay | AppointmentDay | Age | Neighbourhood | Scholarship | Hypertension | Diabetes | Alcoholism | Handcap | SMS_received | No-show | Hipertension | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | F | 2016-04-29 18:38:08+00:00 | 2016-04-29 18:38:08+00:00 | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | No | None |
app_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 110527 entries, 0 to 110526 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Gender 110527 non-null object 1 ScheduledDay 110527 non-null datetime64[ns, UTC] 2 AppointmentDay 110527 non-null datetime64[ns, UTC] 3 Age 110527 non-null int64 4 Neighbourhood 110527 non-null object 5 Scholarship 110527 non-null int64 6 Hypertension 110527 non-null int64 7 Diabetes 110527 non-null int64 8 Alcoholism 110527 non-null int64 9 Handcap 110527 non-null int64 10 SMS_received 110527 non-null int64 11 No-show 110527 non-null object 12 Hipertension 0 non-null object dtypes: datetime64[ns, UTC](2), int64(7), object(4) memory usage: 11.0+ MB
# Adding more columns for the time components of Scheduled and appointment days parsed the cell above.
# Scheduled day new columns
app_df['SchYear']=app_df.ScheduledDay.dt.year
app_df['SchMonth']=app_df.ScheduledDay.dt.month
app_df['SchDay']=app_df.ScheduledDay.dt.day
app_df['SchHour']=app_df.ScheduledDay.dt.hour
app_df['SchWeekday']=app_df.ScheduledDay.dt.dayofweek
# Appointment day new columns
app_df['AppYear']=app_df.AppointmentDay.dt.year
app_df['AppMonth']=app_df.AppointmentDay.dt.month
app_df['AppDay']=app_df.AppointmentDay.dt.day
app_df['AppHour']=app_df.AppointmentDay.dt.hour
app_df['AppWeekday']=app_df.AppointmentDay.dt.weekday
# Confirming the new columns added
pd.set_option('display.max_columns',None)
app_df.sample(5)
| Gender | ScheduledDay | AppointmentDay | Age | Neighbourhood | Scholarship | Hypertension | Diabetes | Alcoholism | Handcap | SMS_received | No-show | Hipertension | SchYear | SchMonth | SchDay | SchHour | SchWeekday | AppYear | AppMonth | AppDay | AppHour | AppWeekday | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 107560 | F | 2016-06-01 14:52:32+00:00 | 2016-06-01 14:52:32+00:00 | 80 | REDENÇÃO | 0 | 1 | 1 | 0 | 0 | 0 | No | None | 2016 | 6 | 1 | 14 | 2 | 2016 | 6 | 1 | 14 | 2 |
| 79187 | M | 2016-05-20 09:41:09+00:00 | 2016-05-20 09:41:09+00:00 | 86 | SÃO CRISTÓVÃO | 0 | 1 | 0 | 0 | 0 | 1 | No | None | 2016 | 5 | 20 | 9 | 4 | 2016 | 5 | 20 | 9 | 4 |
| 15657 | M | 2016-04-27 10:59:23+00:00 | 2016-04-27 10:59:23+00:00 | 29 | ILHA DO PRÍNCIPE | 0 | 0 | 0 | 0 | 0 | 0 | No | None | 2016 | 4 | 27 | 10 | 2 | 2016 | 4 | 27 | 10 | 2 |
| 30683 | F | 2016-05-04 09:58:24+00:00 | 2016-05-04 09:58:24+00:00 | 38 | CENTRO | 0 | 0 | 0 | 0 | 0 | 1 | No | None | 2016 | 5 | 4 | 9 | 2 | 2016 | 5 | 4 | 9 | 2 |
| 44801 | F | 2016-05-12 12:47:44+00:00 | 2016-05-12 12:47:44+00:00 | 55 | INHANGUETÁ | 0 | 0 | 0 | 0 | 1 | 0 | No | None | 2016 | 5 | 12 | 12 | 3 | 2016 | 5 | 12 | 12 | 3 |
# Droping the unnecesary columns
unwanted_cols=['ScheduledDay','AppointmentDay','Hipertension']
app_df.drop(columns=unwanted_cols,axis=1,inplace=True)
app_df.head(1)
| Gender | Age | Neighbourhood | Scholarship | Hypertension | Diabetes | Alcoholism | Handcap | SMS_received | No-show | SchYear | SchMonth | SchDay | SchHour | SchWeekday | AppYear | AppMonth | AppDay | AppHour | AppWeekday | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | F | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | No | 2016 | 4 | 29 | 18 | 4 | 2016 | 4 | 29 | 18 | 4 |
# Cleaning the age
# First we check where the age is less than 0, that is -1
app_df[app_df['Age']<0]
| Gender | Age | Neighbourhood | Scholarship | Hypertension | Diabetes | Alcoholism | Handcap | SMS_received | No-show | SchYear | SchMonth | SchDay | SchHour | SchWeekday | AppYear | AppMonth | AppDay | AppHour | AppWeekday | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 99832 | F | -1 | ROMÃO | 0 | 0 | 0 | 0 | 0 | 0 | No | 2016 | 6 | 6 | 8 | 0 | 2016 | 6 | 6 | 8 | 0 |
# We replace this value with the most frequently occuring age.
app_df.Age.mode()
# We replace the value with 0 which appears to be most occuring.
app_df.loc[app_df['Age']<0,'Age']=0
# Confirming if the value is replaced
app_df.loc[app_df['Age']<0]
| Gender | Age | Neighbourhood | Scholarship | Hypertension | Diabetes | Alcoholism | Handcap | SMS_received | No-show | SchYear | SchMonth | SchDay | SchHour | SchWeekday | AppYear | AppMonth | AppDay | AppHour | AppWeekday |
|---|
# Calculating the current mean after replacing the value with 0
np.mean(app_df.Age)
37.08888325929411
# We can now save our cleaned data into a dataframe called Newappoint.csv
app_df.to_csv('Newappoint.csv',index=False)
In this section we will compute
statistics and create visualizationsas we seek to answer the questions at the introduction part.
# Load the newappointment csv file
data=pd.read_csv('Newappoint.csv')
data.head(1)
| Gender | Age | Neighbourhood | Scholarship | Hypertension | Diabetes | Alcoholism | Handcap | SMS_received | No-show | SchYear | SchMonth | SchDay | SchHour | SchWeekday | AppYear | AppMonth | AppDay | AppHour | AppWeekday | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | F | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | No | 2016 | 4 | 29 | 18 | 4 | 2016 | 4 | 29 | 18 | 4 |
# What categories are in the gender column?
gender_pct=data.Gender.value_counts(normalize=True)
print('Female are {:.0%} while males are {:.0%} in the dataset given.'.format(gender_pct[0],gender_pct[1]))
Female are 65% while males are 35% in the dataset given.
def Get_barplot(label_0, height_0, label_1, height_1, col_0='g', col_1='r', Title='Title', yLabel='Value_count', xLabel=''):
"""This function plots 2 barplots in a single chart/canvas for comparision purposes.
The first one takes arguments label_0, height_0, and color green.
The second takes argumens label_1, height_1, and color red.
"""
fig, ax = plt.subplots()
ax.bar(label_0, height_0, label=label_0, color=col_0)
ax.bar(label_1, height_1, label=label_1, color=col_1)
ax.set_title(Title, fontweight='bold')
ax.set_ylabel(yLabel)
ax.set_xlabel(xLabel)
ax.legend()
Get_barplot(label_0=data.Gender.value_counts().index[0], height_0=data.Gender.value_counts().values[0],
label_1=data.Gender.value_counts().index[1], height_1=data.Gender.value_counts().values[1],
Title='Gender counts (Male and Female)', xLabel='Gender: Female (F) OR Male (M)')
# Plotting histogram to visualize the distribution based on show up status.
fig=px.histogram(data,x='Gender',title='Gender distribution',color='No-show',barmode='group',marginal='box')
fig.show()
# Plotting histogram to visualize the percetage distribution of gender based on the show up column.
fig=px.histogram(data,x='Gender',title='Percentage of Gender distribution',color='No-show',barmode='group',barnorm='percent')
fig.show()
# What years did the scheduling take place?
data.SchYear.value_counts(normalize=True)*100
2016 99.943905 2015 0.056095 Name: SchYear, dtype: float64
# What years were the appointments were done?
data.AppYear.value_counts()
2016 110465 2015 62 Name: AppYear, dtype: int64
data.head(1)
| Gender | Age | Neighbourhood | Scholarship | Hypertension | Diabetes | Alcoholism | Handcap | SMS_received | No-show | SchYear | SchMonth | SchDay | SchHour | SchWeekday | AppYear | AppMonth | AppDay | AppHour | AppWeekday | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | F | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | No | 2016 | 4 | 29 | 18 | 4 | 2016 | 4 | 29 | 18 | 4 |
# Creating a dataframe with the Appointment Time and then creating their subplots.
Apptime_df=data.iloc[:,16:]
plt.figure(figsize=(14,10),facecolor='white')
fignumber=1
for col in Apptime_df:
if fignumber<=6:
ax=plt.subplot(3,2,fignumber)
sns.histplot(Apptime_df[col])
plt.xlabel(col,fontsize=20)
plt.ylabel('Value_Count',fontsize=20)
plt.title('{} Distribution'.format(col))
fignumber+=1
plt.tight_layout()
# Creating a dataframe with the Time for Scheduling the appointment and then creating their subplots.
Schtime_df=data.iloc[:,11:15]
plt.figure(figsize=(14,10),facecolor='white')
fignumber=1
for col in Schtime_df:
if fignumber<=6:
ax=plt.subplot(3,2,fignumber)
sns.histplot(Schtime_df[col])
plt.xlabel(col,fontsize=20)
plt.ylabel('Value_Count',fontsize=20)
plt.title('{} Distribution'.format(col))
fignumber+=1
plt.tight_layout()
# Duration (Months) covered in the entire dataset from 2015 to 2016
data.groupby('AppYear').AppMonth.value_counts()
AppYear AppMonth
2015 12 61
11 1
2016 5 67421
4 25339
6 13750
3 3614
2 281
1 60
Name: AppMonth, dtype: int64
# Duration (Dayweek) covered in the dataset to confirm the subplots above if day 6-Sunday is there.
data.AppWeekday.value_counts()
1 26168 2 24262 0 23085 4 18915 3 18073 5 24 Name: AppWeekday, dtype: int64
# Plotting the scheduled year to see which months had highest scheduled appointments
fig=px.histogram(data,x='SchYear',color='SchMonth',barmode='group',title='Distribution sheduled months for both 2016 and 2015',
labels={'SchYear':'Scheduled Year','SchMonth':'Scheduled Month'})
fig.show()
# Plotting the scheduled year to see which months had highest scheduled appointments
fig=px.histogram(data,x='SchMonth',color='SchWeekday',barmode='group',title='Distribution of Scheduled days monthwise',
labels={'SchWeekday':'Scheduled Weekday','SchMonth':'Scheduled Month'})
fig.show()
# Plotting the appointment year to see which days had highest appointments
fig=px.histogram(data,x='AppYear',color='AppMonth',barmode='group',title='Distribution of appointment months for both 2016 and 2015',
labels={'AppYear':'Appointment Year','AppMonth':'Appointment Month'})
fig.show()
# Plotting the appointment year to see which days had highest scheduled appointments
fig=px.histogram(data,x='AppMonth',color='AppWeekday',barmode='group',title='Distribution of appointment days monthwise',
labels={'AppWeekday':'Appointment Weekday','AppMonth':'Appointment Month'})
fig.show()
# Creating a bar plot to see what hour especially on average did many appointments take.
plt.figure(figsize=(14,7))
sns.barplot(data=data,y='AppHour',x='AppWeekday');
data.Age.value_counts().head()
0 3540 1 2273 52 1746 49 1652 53 1651 Name: Age, dtype: int64
# Visualizing the Age distribution in the dataset
fig=px.histogram(data,x='Age',title='Distribution of age.',color='Gender',marginal='box')
fig.show()
# Visualizing the Age distribution in the dataset
plt.figure(figsize=(14,7))
sns.histplot(x=data['Age'],hue=data['No-show'],kde=True)
plt.title('Age distribution')
plt.yticks(rotation=90);
# Creating a dataframe with different conditions such as diseases and habits and then creating their subplots.
condition_df=data.loc[:,['Scholarship','Hypertension','Diabetes','Alcoholism','Handcap','SMS_received']]
plt.figure(figsize=(14,10),facecolor='white')
fignumber=1
for col in condition_df:
if fignumber<=6:
ax=plt.subplot(3,2,fignumber)
sns.barplot(x=condition_df[col],y=data['Age'])
plt.xlabel(col,fontsize=20)
plt.ylabel('Age',fontsize=20)
plt.title('Age against {}'.format(col))
fignumber+=1
plt.tight_layout()
# Getting the number of unique values in the neigbourhood colun
data.Neighbourhood.nunique()
81
#Getting the counts of people per location
data.Neighbourhood.value_counts()
JARDIM CAMBURI 7717
MARIA ORTIZ 5805
RESISTÊNCIA 4431
JARDIM DA PENHA 3877
ITARARÉ 3514
...
ILHA DO BOI 35
ILHA DO FRADE 10
AEROPORTO 8
ILHAS OCEÂNICAS DE TRINDADE 2
PARQUE INDUSTRIAL 1
Name: Neighbourhood, Length: 81, dtype: int64
plt.figure(figsize=(16,8))
sns.countplot(data=data,x='Neighbourhood',hue='No-show',order=data.Neighbourhood.value_counts().index)
plt.xticks(rotation=90)
plt.title('Appointments per Neighbourhood');
# To check the distribution of scholarship
data.Scholarship.value_counts()/len(data)*100
# Assuming 0 means no schorlarship and 1 means has scholarship.
0 90.173442 1 9.826558 Name: Scholarship, dtype: float64
# Plotting histogram to show the distribution in comparison to gender
fig=px.histogram(data,x='Scholarship',color='Gender',barmode='group',title=
'Percentage distribution of Scholarship Gender-wise',barnorm='percent')
fig.show()
# Plotting histogram to show Scholarship distribution according to show up.
fig=px.histogram(data,x='Scholarship',color='No-show',barmode='group',title=
'Percentage distribution of Scholarship',barnorm='percent')
fig.show()
data.Hypertension.value_counts()/len(data)*100
0 80.275408 1 19.724592 Name: Hypertension, dtype: float64
# Plotting histogram for hypertension
fig=px.histogram(data,x='Hypertension',color='Gender',barmode='group',title=
'Percentage distribution of people with Hypertension',barnorm='percent')
fig.show()
# Plotting histogram to show hypertenion distribution
fig=px.histogram(data,x='Hypertension',color='No-show',barmode='group',title=
'Percentage distribution of people with Hypertension',barnorm='percent')
fig.show()
Diabetes¶data.Diabetes.value_counts()/len(data)*100
0 92.813521 1 7.186479 Name: Diabetes, dtype: float64
# Plotting histogram to show Diabetes distribution gender-wise
fig=px.histogram(data,x='Diabetes',color='Gender',barmode='group',title=
'Percentage distribution of people with Diabetes',barnorm='percent')
fig.show()
# Plotting histogram to show Diabetes distribution based on scholarship status
fig=px.histogram(data,x='Diabetes',color='Scholarship',barmode='group',title=
'Percentage distribution of people with Diabetes',barnorm='percent')
fig.show()
# Plotting histogram to show Diabetes distribution
fig=px.histogram(data,x='Diabetes',color='No-show',barmode='group',title=
'Percentage distribution of Diabetes',barnorm='percent')
fig.show()
# Counting the number of people per class (alcoholic or non-alcoholic)
data.Alcoholism.value_counts()
0 107167 1 3360 Name: Alcoholism, dtype: int64
# Plotting histogram to show Alcoholism distribution
fig=px.histogram(data,x='Alcoholism',color='Scholarship',barmode='group',title=
'Percentage distribution of Alcoholism',barnorm='percent')
fig.show()
# Plotting histogram to show Alcoholism distribution
fig=px.histogram(data,x='Alcoholism',color='No-show',barmode='group',title=
'Percentage distribution of Alcoholism',barnorm='percent')
fig.show()
# Plotting histogram to show Alcoholism distribution
fig=px.histogram(data,x='Alcoholism',color='Gender',barmode='group',title=
'Percentage distribution of Alcoholism',barnorm='percent')
fig.show()
data.SMS_received.value_counts()/len(data)*100
0 67.897437 1 32.102563 Name: SMS_received, dtype: float64
# Plotting histogram to show SMS received distribution
fig=px.histogram(data,x='SMS_received',color='No-show',barmode='group',title=
'Percentage distribution of SMS received',barnorm='percent')
fig.show()
Although we have more females than males showing up for the appointments, the trend for showing up seems the same between female and male that is about 80% did show up, but about 20% didn't show up across the gender.
It appears that many appointments were scheduled for the age 0-1. This could be because of children have progams such as vaccinations that require them to visit regularly. Again, the age at about 50 there seems to be more people with appointments and especially Female gender, perharps because as people age, they are prone to diseases due to many factors such as hormonal changes and many more.
We have also realized that we have more female with lifestyle diseases like diabetes and hypertension as compared to men and that those who suffer from lifestyle diseases like pressure and diabetes are more likely to show up for the appointment
It has been noted that we have more men who take alcohol, as compared to female. Those who take alcohol have more scholarships perharps because they are prone to poverty issues that Bolsa Familia addresses.
It has also been noted that features such as scholarship and sms received did not contribute significantly to the show up column.
The only
limitationI encountered is the lack of proper demographics for the population in this dataset. For instance we could have performed a better analysis based on the location of the people to compare if staying close to the facility (Neighbourhood) would affect the show up or not. Again knowing wether the client is residing in a city or rural set up combined with the information whether neighbourhood is a city or a rural facity would have been of much help for us to know how residence affects show up. literacy level would also have greatly impacted the analysis.